As mentioned in the beginning of lecture, there are (at least) two basic learning setups. In supervised machine learning one observes a response \(Y\) with observing \(p\) different features \(X=(X_1,X_2,\ldots,X_p)\), where we typically postulate the relationship \(Y = f(X)+\varepsilon\) and \(\varepsilon\) independent of \(X\) with mean zero. Here quality is usually assessed by the (test/out-of-sample) error that compares predictions and realizations for a separate dataset. In unsupervised learning, we only observe \(p\) features \(X_1,X_2,\ldots,X_p\), and we would like to learn about their relationship – without focussing on a supervising outcome. Of course, the difficulty is how to assess quality in this case – so different unsupervised learning techniques are very different, and which one to pick will depend on the nature of the problem.
In this tutorial, we will take a closer look at Clustering. There are variety of other techniques, including anomaly detection, self-organizing maps, association analysis, etc.
As before, let’s start our fresh and install required packages, if required:
rm(list=ls())
#install.packages("maps")
#install.packages("mapproj")
#install.packages("NbClust")
Clustering refers to techniques for finding subgroups in a given dataset. The typical approach to determine clusters \(C_1,\ldots,C_K\) is to minimize: \[ \sum_{k=1}^K W(C_k), \] where \(W\) is a measure of variation within a cluster. For instance, k-means clustering uses the Euclidean distance to measure variation: \[ W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} = x_{i'j})^2. \] The algorithms are implemented via a greedy algorithm by considering the centers of clusters (referred to as centroids). The number of clusters \(K\) must be chosen beforehand. One approach is hierarchical clustering, where one starts with a larger number of clusters and then fuses custers that are similar (e.g., with regards to the distance between their centroids).
Let’s consider a very basic simulated example – let’s simulate normal random variables with different means:
set.seed(2)
x = matrix(rnorm(50*2), ncol=2)
x[1:25,1] = x[1:25,1] + 3
x[1:25,2] = x[1:25,2] - 4
And let’s run k means clustering, where we set nstart to 20, meaning that we consider 20 initial configurations. The best one is reported:
km.out = kmeans(x,2,nstart=20)
plot(x, col=(km.out$cluster+ 1), main="K-Means Clustering Results with K=2", xlab="", ylab="", pch=20, cex=2)
So the algorithm was able to identify how we set up the data!
We analyze County Health Rankings in the US in 2013, based on a data set from the University of Wisconsin Population Health Institute. I took this from a class I taught with Jim Guszcza, and he put this together (it’s much better than what I had before :-).
Let’s load the data:
raw <- read.csv("countyHealthRR.csv")
dim(raw)
[1] 3141 33
raw[1:3,]
FIPS State County YPLL.Rate Perc.Fair.Poor.Health
1 1001 Alabama Autauga 8376 23
2 1003 Alabama Baldwin 7770 13
3 1005 Alabama Barbour 9458 22
Physically.Unhealthy.Days Mentally.Unhealthy.Days Perc.Low.Birth.Weight
1 5.1 4.3 9.4
2 3.3 3.8 8.8
3 5.0 4.1 12.3
Perc.Smokers Perc.Obese Perc.Physically.Inactive Perc.Excessive.Drinking
1 24 34 32 17
2 23 26 25 19
3 24 37 34 13
MV.Mortality.Rate Chlamydia.Rate Teen.Birth.Rate Perc.Uninsured
1 24 363 43 14
2 20 306 48 19
3 26 645 74 19
Pr.Care.Physician.Ratio Dentist.Ratio Prev.Hosp.Stay.Rate
1 2731 3935 71
2 1347 2379 56
3 2282 3398 109
Perc.Diabetic.Screening Perc.Mammography Perc.High.School.Grad
1 85 68.5 80
2 81 69.9 74
3 86 64.8 61
Perc.Some.College Perc.Unemployed Perc.Children.in.Poverty
1 54.4 8.0 21
2 62.1 8.1 21
3 42.1 11.4 40
Perc.No.Soc.Emo.Support Perc.Single.Parent.HH Violent.Crime.Rate
1 24 26 300
2 19 28 215
3 18 57 150
Avg.Daily.Particulates Perc.pop.in.viol Rec.Facility.Rate Perc.Limited.Access
1 13.3 0 7.3 10
2 11.8 8 8.7 5
3 12.1 8 3.7 11
Perc.Fast.Foods
1 52
2 35
3 56
library(psych)
Warning: package 'psych' was built under R version 3.5.2
describe(raw)[,1:5]
vars n mean sd median
FIPS 1 3141 30408.40 15152.78 29179.00
State* 2 3141 27.26 14.26 26.00
County* 3 3141 927.12 517.87 925.00
YPLL.Rate 4 2987 8060.13 2407.31 7727.00
Perc.Fair.Poor.Health 5 2743 16.63 5.69 16.00
Physically.Unhealthy.Days 6 2897 3.74 1.05 3.70
Mentally.Unhealthy.Days 7 2862 3.43 0.98 3.40
Perc.Low.Birth.Weight 8 2922 8.29 2.05 7.90
Perc.Smokers 9 2536 20.52 6.09 20.00
Perc.Obese 10 3141 30.39 4.17 31.00
Perc.Physically.Inactive 11 3141 27.90 5.16 28.00
Perc.Excessive.Drinking 12 2623 14.52 5.61 15.00
MV.Mortality.Rate 13 2773 22.06 10.50 20.00
Chlamydia.Rate 14 3135 309.33 270.86 232.00
Teen.Birth.Rate 15 2960 45.85 20.23 44.00
Perc.Uninsured 16 3140 18.57 5.61 18.00
Pr.Care.Physician.Ratio 17 2751 2486.47 1535.72 1987.00
Dentist.Ratio 18 2984 3245.48 1814.83 2753.50
Prev.Hosp.Stay.Rate 19 3011 78.75 31.27 73.00
Perc.Diabetic.Screening 20 3094 83.64 6.78 85.00
Perc.Mammography 21 3055 63.08 8.39 63.60
Perc.High.School.Grad 22 3107 82.63 10.03 84.00
Perc.Some.College 23 3141 54.17 11.99 54.20
Perc.Unemployed 24 3140 8.57 2.99 8.35
Perc.Children.in.Poverty 25 3140 24.74 9.02 24.00
Perc.No.Soc.Emo.Support 26 2471 19.39 5.44 19.00
Perc.Single.Parent.HH 27 3137 30.99 10.20 30.00
Violent.Crime.Rate 28 2966 270.89 226.59 211.00
Avg.Daily.Particulates 29 3107 11.08 1.81 11.20
Perc.pop.in.viol 30 3084 9.62 20.11 0.00
Rec.Facility.Rate 31 3141 7.49 7.59 6.70
Perc.Limited.Access 32 3141 8.37 8.22 6.00
Perc.Fast.Foods 33 2989 45.48 13.72 47.00
So it’s a large dataset, and the first three numbers are indices. However, the remaning columns provide rather detailed information for each county. Let’s visualize:
library(corrplot)
corrplot 0.84 loaded
rho <- cor(raw[,-(1:3)], use="pairwise.complete.obs")
col3 <- colorRampPalette(c("red", "white", "blue"))
corrplot(rho, tl.cex=.7, order="hclust", col=col3(50))
corrplot(rho, order="hclust", method="shade", col=col3(50), tl.cex=.7)
So quite a bit of correlation, but nothing seems to be “replicated.” Unfortunately, we have a bunch of missing data. Let’s try to omit missing data:
dat <- raw
dd <- describe(dat)[,1:5]
dd
vars n mean sd median
FIPS 1 3141 30408.40 15152.78 29179.00
State* 2 3141 27.26 14.26 26.00
County* 3 3141 927.12 517.87 925.00
YPLL.Rate 4 2987 8060.13 2407.31 7727.00
Perc.Fair.Poor.Health 5 2743 16.63 5.69 16.00
Physically.Unhealthy.Days 6 2897 3.74 1.05 3.70
Mentally.Unhealthy.Days 7 2862 3.43 0.98 3.40
Perc.Low.Birth.Weight 8 2922 8.29 2.05 7.90
Perc.Smokers 9 2536 20.52 6.09 20.00
Perc.Obese 10 3141 30.39 4.17 31.00
Perc.Physically.Inactive 11 3141 27.90 5.16 28.00
Perc.Excessive.Drinking 12 2623 14.52 5.61 15.00
MV.Mortality.Rate 13 2773 22.06 10.50 20.00
Chlamydia.Rate 14 3135 309.33 270.86 232.00
Teen.Birth.Rate 15 2960 45.85 20.23 44.00
Perc.Uninsured 16 3140 18.57 5.61 18.00
Pr.Care.Physician.Ratio 17 2751 2486.47 1535.72 1987.00
Dentist.Ratio 18 2984 3245.48 1814.83 2753.50
Prev.Hosp.Stay.Rate 19 3011 78.75 31.27 73.00
Perc.Diabetic.Screening 20 3094 83.64 6.78 85.00
Perc.Mammography 21 3055 63.08 8.39 63.60
Perc.High.School.Grad 22 3107 82.63 10.03 84.00
Perc.Some.College 23 3141 54.17 11.99 54.20
Perc.Unemployed 24 3140 8.57 2.99 8.35
Perc.Children.in.Poverty 25 3140 24.74 9.02 24.00
Perc.No.Soc.Emo.Support 26 2471 19.39 5.44 19.00
Perc.Single.Parent.HH 27 3137 30.99 10.20 30.00
Violent.Crime.Rate 28 2966 270.89 226.59 211.00
Avg.Daily.Particulates 29 3107 11.08 1.81 11.20
Perc.pop.in.viol 30 3084 9.62 20.11 0.00
Rec.Facility.Rate 31 3141 7.49 7.59 6.70
Perc.Limited.Access 32 3141 8.37 8.22 6.00
Perc.Fast.Foods 33 2989 45.48 13.72 47.00
dim(dat)
[1] 3141 33
dim(na.omit(dat))
[1] 1740 33
So unfortunately we are dropping a lot if data. Hence, instead, let’s drop a few variables that lead to a high missing data rate.
keep <- dd$var[dd$n > 2800]
names(dat)
[1] "FIPS" "State"
[3] "County" "YPLL.Rate"
[5] "Perc.Fair.Poor.Health" "Physically.Unhealthy.Days"
[7] "Mentally.Unhealthy.Days" "Perc.Low.Birth.Weight"
[9] "Perc.Smokers" "Perc.Obese"
[11] "Perc.Physically.Inactive" "Perc.Excessive.Drinking"
[13] "MV.Mortality.Rate" "Chlamydia.Rate"
[15] "Teen.Birth.Rate" "Perc.Uninsured"
[17] "Pr.Care.Physician.Ratio" "Dentist.Ratio"
[19] "Prev.Hosp.Stay.Rate" "Perc.Diabetic.Screening"
[21] "Perc.Mammography" "Perc.High.School.Grad"
[23] "Perc.Some.College" "Perc.Unemployed"
[25] "Perc.Children.in.Poverty" "Perc.No.Soc.Emo.Support"
[27] "Perc.Single.Parent.HH" "Violent.Crime.Rate"
[29] "Avg.Daily.Particulates" "Perc.pop.in.viol"
[31] "Rec.Facility.Rate" "Perc.Limited.Access"
[33] "Perc.Fast.Foods"
names(dat)[keep]
[1] "FIPS" "State"
[3] "County" "YPLL.Rate"
[5] "Physically.Unhealthy.Days" "Mentally.Unhealthy.Days"
[7] "Perc.Low.Birth.Weight" "Perc.Obese"
[9] "Perc.Physically.Inactive" "Chlamydia.Rate"
[11] "Teen.Birth.Rate" "Perc.Uninsured"
[13] "Dentist.Ratio" "Prev.Hosp.Stay.Rate"
[15] "Perc.Diabetic.Screening" "Perc.Mammography"
[17] "Perc.High.School.Grad" "Perc.Some.College"
[19] "Perc.Unemployed" "Perc.Children.in.Poverty"
[21] "Perc.Single.Parent.HH" "Violent.Crime.Rate"
[23] "Avg.Daily.Particulates" "Perc.pop.in.viol"
[25] "Rec.Facility.Rate" "Perc.Limited.Access"
[27] "Perc.Fast.Foods"
names(dat)[-keep]
[1] "Perc.Fair.Poor.Health" "Perc.Smokers"
[3] "Perc.Excessive.Drinking" "MV.Mortality.Rate"
[5] "Pr.Care.Physician.Ratio" "Perc.No.Soc.Emo.Support"
dim(na.omit(dat[,keep]))
[1] 2387 27
Better. Let’s take a look at the new set:
dat <- dat[,keep]
dim(dat)
[1] 3141 27
describe(dat)[,1:5]
vars n mean sd median
FIPS 1 3141 30408.40 15152.78 29179.00
State* 2 3141 27.26 14.26 26.00
County* 3 3141 927.12 517.87 925.00
YPLL.Rate 4 2987 8060.13 2407.31 7727.00
Physically.Unhealthy.Days 5 2897 3.74 1.05 3.70
Mentally.Unhealthy.Days 6 2862 3.43 0.98 3.40
Perc.Low.Birth.Weight 7 2922 8.29 2.05 7.90
Perc.Obese 8 3141 30.39 4.17 31.00
Perc.Physically.Inactive 9 3141 27.90 5.16 28.00
Chlamydia.Rate 10 3135 309.33 270.86 232.00
Teen.Birth.Rate 11 2960 45.85 20.23 44.00
Perc.Uninsured 12 3140 18.57 5.61 18.00
Dentist.Ratio 13 2984 3245.48 1814.83 2753.50
Prev.Hosp.Stay.Rate 14 3011 78.75 31.27 73.00
Perc.Diabetic.Screening 15 3094 83.64 6.78 85.00
Perc.Mammography 16 3055 63.08 8.39 63.60
Perc.High.School.Grad 17 3107 82.63 10.03 84.00
Perc.Some.College 18 3141 54.17 11.99 54.20
Perc.Unemployed 19 3140 8.57 2.99 8.35
Perc.Children.in.Poverty 20 3140 24.74 9.02 24.00
Perc.Single.Parent.HH 21 3137 30.99 10.20 30.00
Violent.Crime.Rate 22 2966 270.89 226.59 211.00
Avg.Daily.Particulates 23 3107 11.08 1.81 11.20
Perc.pop.in.viol 24 3084 9.62 20.11 0.00
Rec.Facility.Rate 25 3141 7.49 7.59 6.70
Perc.Limited.Access 26 3141 8.37 8.22 6.00
Perc.Fast.Foods 27 2989 45.48 13.72 47.00
dim(dat)
[1] 3141 27
dat <- na.omit(dat)
dim(dat)
[1] 2387 27
And let’s throw out IDs:
ID <- dat[,1:3]
summary(ID)
FIPS State County
Min. : 1001 Texas : 127 Washington: 29
1st Qu.:18164 Georgia : 118 Franklin : 22
Median :29077 Kentucky : 100 Jefferson : 22
Mean :30021 Missouri : 95 Lincoln : 19
3rd Qu.:42092 North Carolina: 91 Jackson : 18
Max. :56045 Iowa : 87 Madison : 16
(Other) :1769 (Other) :2261
dat <- dat[,4:ncol(dat)]
And let’s scale the data so as to make sure we are comparing apples to apples:
dat <- scale(dat)
dat <- as.data.frame(dat)
dat[1:2,]
YPLL.Rate Physically.Unhealthy.Days Mentally.Unhealthy.Days
1 0.22195709 1.2741861 0.8531928
2 -0.05159446 -0.5000776 0.3187999
Perc.Low.Birth.Weight Perc.Obese Perc.Physically.Inactive Chlamydia.Rate
1 0.6386909 0.8490322 0.8371057 0.21310594
2 0.3279272 -1.0541393 -0.4860614 -0.02745043
Teen.Birth.Rate Perc.Uninsured Dentist.Ratio Prev.Hosp.Stay.Rate
1 -0.06711232 -0.7356113 0.4220123 -0.2013252
2 0.19654323 0.2310834 -0.4604586 -0.7004909
Perc.Diabetic.Screening Perc.Mammography Perc.High.School.Grad
1 0.1790685 0.6209114 -0.2315002
2 -0.4663623 0.8003722 -0.9040249
Perc.Some.College Perc.Unemployed Perc.Children.in.Poverty
1 -0.007452166 -0.2813093 -0.3888458
2 0.664958589 -0.2452099 -0.3888458
Perc.Single.Parent.HH Violent.Crime.Rate Avg.Daily.Particulates
1 -0.5712313 0.08219526 1.177996
2 -0.3510409 -0.30339135 0.335747
Perc.pop.in.viol Rec.Facility.Rate Perc.Limited.Access Perc.Fast.Foods
1 -0.47427608 -0.1350508 0.5922044 0.5055126
2 -0.03336395 0.0844166 -0.3700773 -0.8828594
describe(dat)[,1:4]
vars n mean sd
YPLL.Rate 1 2387 0 1
Physically.Unhealthy.Days 2 2387 0 1
Mentally.Unhealthy.Days 3 2387 0 1
Perc.Low.Birth.Weight 4 2387 0 1
Perc.Obese 5 2387 0 1
Perc.Physically.Inactive 6 2387 0 1
Chlamydia.Rate 7 2387 0 1
Teen.Birth.Rate 8 2387 0 1
Perc.Uninsured 9 2387 0 1
Dentist.Ratio 10 2387 0 1
Prev.Hosp.Stay.Rate 11 2387 0 1
Perc.Diabetic.Screening 12 2387 0 1
Perc.Mammography 13 2387 0 1
Perc.High.School.Grad 14 2387 0 1
Perc.Some.College 15 2387 0 1
Perc.Unemployed 16 2387 0 1
Perc.Children.in.Poverty 17 2387 0 1
Perc.Single.Parent.HH 18 2387 0 1
Violent.Crime.Rate 19 2387 0 1
Avg.Daily.Particulates 20 2387 0 1
Perc.pop.in.viol 21 2387 0 1
Rec.Facility.Rate 22 2387 0 1
Perc.Limited.Access 23 2387 0 1
Perc.Fast.Foods 24 2387 0 1
Let’s again look at the correlations:
describe(dat)[,1:4]
vars n mean sd
YPLL.Rate 1 2387 0 1
Physically.Unhealthy.Days 2 2387 0 1
Mentally.Unhealthy.Days 3 2387 0 1
Perc.Low.Birth.Weight 4 2387 0 1
Perc.Obese 5 2387 0 1
Perc.Physically.Inactive 6 2387 0 1
Chlamydia.Rate 7 2387 0 1
Teen.Birth.Rate 8 2387 0 1
Perc.Uninsured 9 2387 0 1
Dentist.Ratio 10 2387 0 1
Prev.Hosp.Stay.Rate 11 2387 0 1
Perc.Diabetic.Screening 12 2387 0 1
Perc.Mammography 13 2387 0 1
Perc.High.School.Grad 14 2387 0 1
Perc.Some.College 15 2387 0 1
Perc.Unemployed 16 2387 0 1
Perc.Children.in.Poverty 17 2387 0 1
Perc.Single.Parent.HH 18 2387 0 1
Violent.Crime.Rate 19 2387 0 1
Avg.Daily.Particulates 20 2387 0 1
Perc.pop.in.viol 21 2387 0 1
Rec.Facility.Rate 22 2387 0 1
Perc.Limited.Access 23 2387 0 1
Perc.Fast.Foods 24 2387 0 1
corrplot(cor(dat), tl.cex=.7, order="hclust", col=col3(50))
As a heuristic to suggest the appropriate number of clusters, we evaluate how the within sum of squares varies by clusters:
ss <- function(x) sum( ( x-mean(x) )^2 )
wss <- NULL
wss[1] <- sum( apply(dat,2,ss) )
for (k in 2:10) {
temp <- kmeans(dat, k)
wss[k] <- sum(temp$withinss)
}
barplot(wss, col="dodgerblue", names.arg=1:length(wss)
, xlab="Number of Clusters (k)"
, ylab="Total Within Sum of Squares")
abline(h=0)
title("Within Sum-of-Squares Analysis", col.main="navy")
One way to run the clustering algorithm that allows for a number of “knobs” is NbClust from the library NbClust (e.g., try NbClust(dat, min.nc=3, max.nc=6, method="kmeans")) but it takes a long time to run. So let’s use the more basic package again, where we set the number of clusters to five:
k <- 5
set.seed(652)
km <- kmeans(dat, k)
clust.km <- km$cluster
One way of illustrating a cluster is a so-called dendrogram, which illustrates the hierarchical relationship between objects in a hierarchical clustering algorithm (try to zoom):
dd <- dist(dat, method="euclidean")
hc1 <- hclust(dd, method="average")
hc1 <- hclust(dd, method="complete")
hc1 <- hclust(dd, method="ward.D")
plot(hc1, hang=-1)
rect.hclust(hc1, k=6, border="dodgerblue")
rect.hclust(hc1, k=5, border="blue")
rect.hclust(hc1, k=4, border="red")
rect.hclust(hc1, k=3, border="green")
hc1 <- hclust(dd, method="ward.D")
rect.hclust(hc1, k=5, border="dodgerblue")
clust.hc1 <- cutree(hc1,5)
To aide visualization, let’s relabel clusters in terms of average mortality:
reord <- function(cluster){
avg <- tapply(scored$YPLL.Rate, cluster, mean); avg
ord <- order(avg); ord
clus <- factor(cluster, levels=ord); table(clus)
levels(clus) <- 1:length(clus)
return( as.numeric(as.character(clus)) )
}
scored <- dat
scored$clust.km <- reord(clust.km)
scored$clust.hc <- reord(clust.hc1)
Let’s check:
tapply(scored$YPLL.Rate, clust.km, mean)
1 2 3 4 5
-0.164393620 1.062612713 1.343340184 -0.007740001 -0.980619550
tapply(scored$YPLL.Rate, reord(clust.km), mean)
1 2 3 4 5
-0.980619550 -0.164393620 -0.007740001 1.062612713 1.343340184
table(clust.km, reord(clust.km))
clust.km 1 2 3 4 5
1 0 658 0 0 0
2 0 0 0 399 0
3 0 0 0 0 234
4 0 0 457 0 0
5 639 0 0 0 0
tapply(scored$YPLL.Rate, clust.hc1, mean)
1 2 3 4 5
0.6839265 -0.1262175 1.3095073 -1.0664764 -0.5250699
tapply(scored$YPLL.Rate, reord(clust.hc1), mean)
1 2 3 4 5
-1.0664764 -0.5250699 -0.1262175 0.6839265 1.3095073
table(clust.hc1, reord(clust.hc1))
clust.hc1 1 2 3 4 5
1 0 0 0 767 0
2 0 0 342 0 0
3 0 0 0 0 204
4 341 0 0 0 0
5 0 733 0 0 0
table(scored$clust.km, scored$clust.hc)
1 2 3 4 5
1 305 324 9 1 0
2 5 368 62 223 0
3 31 35 268 114 9
4 0 6 0 380 13
5 0 0 3 49 182
So the two algorithms produce similar – but far from equivalent – results.
Let’s perform a PCA:
pc1 <- prcomp(dat)
pc1 <- prcomp(scale(dat))
round(pc1$rotation[,1:2], 3)
PC1 PC2
YPLL.Rate -0.301 0.042
Physically.Unhealthy.Days -0.235 0.189
Mentally.Unhealthy.Days -0.190 0.122
Perc.Low.Birth.Weight -0.248 -0.158
Perc.Obese -0.222 0.126
Perc.Physically.Inactive -0.249 0.238
Chlamydia.Rate -0.179 -0.407
Teen.Birth.Rate -0.288 -0.051
Perc.Uninsured -0.206 -0.053
Dentist.Ratio -0.128 0.293
Prev.Hosp.Stay.Rate -0.216 0.273
Perc.Diabetic.Screening 0.136 0.064
Perc.Mammography 0.194 -0.160
Perc.High.School.Grad 0.165 0.295
Perc.Some.College 0.262 -0.195
Perc.Unemployed -0.193 -0.084
Perc.Children.in.Poverty -0.303 -0.073
Perc.Single.Parent.HH -0.246 -0.310
Violent.Crime.Rate -0.148 -0.407
Avg.Daily.Particulates -0.115 0.128
Perc.pop.in.viol -0.039 0.087
Rec.Facility.Rate 0.160 -0.175
Perc.Limited.Access -0.073 -0.167
Perc.Fast.Foods -0.128 -0.084
Let’s compute the PCs and check their correlation:
pcs <- predict(pc1)
describe(pcs)[,1:5]
vars n mean sd median
PC1 1 2387 0 2.87 0.18
PC2 2 2387 0 1.57 0.12
PC3 3 2387 0 1.39 0.27
PC4 4 2387 0 1.12 0.06
PC5 5 2387 0 1.09 -0.01
PC6 6 2387 0 0.97 -0.13
PC7 7 2387 0 0.91 -0.02
PC8 8 2387 0 0.87 -0.03
PC9 9 2387 0 0.85 -0.01
PC10 10 2387 0 0.84 -0.02
PC11 11 2387 0 0.79 -0.05
PC12 12 2387 0 0.76 0.03
PC13 13 2387 0 0.71 -0.01
PC14 14 2387 0 0.68 0.03
PC15 15 2387 0 0.67 0.01
PC16 16 2387 0 0.65 0.00
PC17 17 2387 0 0.58 -0.01
PC18 18 2387 0 0.55 0.00
PC19 19 2387 0 0.54 -0.01
PC20 20 2387 0 0.50 0.00
PC21 21 2387 0 0.48 -0.01
PC22 22 2387 0 0.45 0.02
PC23 23 2387 0 0.43 0.01
PC24 24 2387 0 0.39 0.00
dim(pcs); dim(dat)
[1] 2387 24
[1] 2387 24
corrplot(cor(pcs))
So as expected, the PCs are uncorrelated.
Let’s prepare a scee plot:
vars <- apply(pcs, 2, var)
sum(vars); ncol(dat); ncol(pcs)
[1] 24
[1] 24
[1] 24
barplot(vars[1:10], col="lightblue", ylab="variance", las=1)
title("Principal Components Analysis Scree Plot", col.main="navy")
abline(h=1:7, col="darkcyan")
abline(h=0)
plot(pc1)
summary(pc1)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.8682 1.5704 1.39358 1.1225 1.08510 0.96611 0.91362
Proportion of Variance 0.3428 0.1027 0.08092 0.0525 0.04906 0.03889 0.03478
Cumulative Proportion 0.3428 0.4455 0.52643 0.5789 0.62799 0.66688 0.70166
PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 0.87179 0.8471 0.83583 0.79341 0.76144 0.70866 0.6752
Proportion of Variance 0.03167 0.0299 0.02911 0.02623 0.02416 0.02092 0.0190
Cumulative Proportion 0.73333 0.7632 0.79233 0.81856 0.84272 0.86365 0.8826
PC15 PC16 PC17 PC18 PC19 PC20 PC21
Standard deviation 0.66962 0.64511 0.58329 0.55006 0.53940 0.50234 0.47829
Proportion of Variance 0.01868 0.01734 0.01418 0.01261 0.01212 0.01051 0.00953
Cumulative Proportion 0.90132 0.91866 0.93284 0.94545 0.95757 0.96809 0.97762
PC22 PC23 PC24
Standard deviation 0.44520 0.43295 0.38929
Proportion of Variance 0.00826 0.00781 0.00631
Cumulative Proportion 0.98588 0.99369 1.00000
So the first PC seems to expplain a lot of the variation. We can illustrate via a so-called bi-plot:
biplot(pc1, col=c("slategrey", "navy"), cex=c(.2, .8))
round(pc1$rotation, 4)[,1:2]
PC1 PC2
YPLL.Rate -0.3011 0.0418
Physically.Unhealthy.Days -0.2354 0.1891
Mentally.Unhealthy.Days -0.1896 0.1225
Perc.Low.Birth.Weight -0.2480 -0.1581
Perc.Obese -0.2217 0.1260
Perc.Physically.Inactive -0.2486 0.2385
Chlamydia.Rate -0.1792 -0.4068
Teen.Birth.Rate -0.2879 -0.0511
Perc.Uninsured -0.2056 -0.0526
Dentist.Ratio -0.1277 0.2932
Prev.Hosp.Stay.Rate -0.2163 0.2732
Perc.Diabetic.Screening 0.1359 0.0640
Perc.Mammography 0.1936 -0.1602
Perc.High.School.Grad 0.1648 0.2953
Perc.Some.College 0.2622 -0.1952
Perc.Unemployed -0.1933 -0.0842
Perc.Children.in.Poverty -0.3026 -0.0726
Perc.Single.Parent.HH -0.2457 -0.3103
Violent.Crime.Rate -0.1479 -0.4066
Avg.Daily.Particulates -0.1146 0.1276
Perc.pop.in.viol -0.0391 0.0866
Rec.Facility.Rate 0.1605 -0.1752
Perc.Limited.Access -0.0726 -0.1669
Perc.Fast.Foods -0.1279 -0.0835
We can visualize our clusters in PC space:
col <- c("blue","dodgerblue","lightgreen","pink","red")
par(mfrow=c(1,2))
clust <- scored$clust.km
plot(pcs, type="n", main="k-means")
text(pcs, labels=clust, col=col[clust])
abline(h=0); abline(v=0)
clust <- scored$clust.hc
plot(pcs, type="n", main="hierarchical clustering")
text(pcs, labels=clust, col=col[clust])
abline(h=0); abline(v=0)
What do you think?
Let’s consider aggregate stats by variable of the clusters:
col <- c("blue", "dodgerblue", "lightgreen", "pink", "red", "maroon", "darkorange")
clust <- scored[, "clust.km"]
agg <- function(x) tapply(x, clust, mean)
summ <- apply(dat, 2, agg)
t(round(summ,2))
1 2 3 4 5
YPLL.Rate -0.98 -0.16 -0.01 1.06 1.34
Physically.Unhealthy.Days -0.84 -0.09 -0.04 1.28 0.42
Mentally.Unhealthy.Days -0.71 -0.06 0.03 1.04 0.27
Perc.Low.Birth.Weight -0.74 -0.24 0.09 0.43 1.80
Perc.Obese -0.73 0.13 -0.33 0.59 1.26
Perc.Physically.Inactive -0.84 0.16 -0.43 1.03 0.94
Chlamydia.Rate -0.47 -0.41 0.48 -0.24 1.91
Teen.Birth.Rate -1.02 -0.22 0.34 0.90 1.21
Perc.Uninsured -0.88 -0.12 0.53 0.63 0.63
Dentist.Ratio -0.47 0.26 -0.47 0.63 0.41
Prev.Hosp.Stay.Rate -0.64 -0.01 -0.41 1.18 0.55
Perc.Diabetic.Screening 0.43 0.12 -0.12 -0.35 -0.68
Perc.Mammography 0.67 -0.01 0.07 -0.86 -0.47
Perc.High.School.Grad 0.56 0.37 -0.63 -0.11 -1.16
Perc.Some.College 1.04 -0.17 0.14 -1.02 -0.88
Perc.Unemployed -0.74 -0.08 0.32 0.35 1.04
Perc.Children.in.Poverty -1.03 -0.21 0.29 0.81 1.48
Perc.Single.Parent.HH -0.78 -0.31 0.47 0.13 1.86
Violent.Crime.Rate -0.48 -0.38 0.69 -0.19 1.34
Avg.Daily.Particulates -0.51 0.22 -0.22 0.47 0.39
Perc.pop.in.viol -0.09 -0.04 -0.08 0.32 -0.01
Rec.Facility.Rate 0.63 -0.15 0.11 -0.59 -0.50
Perc.Limited.Access -0.21 -0.25 0.48 -0.07 0.47
Perc.Fast.Foods -0.39 -0.18 0.13 0.39 0.65
And let’s include the PCs:
scored2 <- dat
scored2$PC1 <- pcs[,1]
scored2$PC2 <- pcs[,2]
scored2$PC3 <- pcs[,3]
clust <- scored[, "clust.km"]
agg <- function(x) tapply(x, clust, mean)
SUMMARY <- apply(scored2, 2, agg)
t(round(SUMMARY,2))
1 2 3 4 5
YPLL.Rate -0.98 -0.16 -0.01 1.06 1.34
Physically.Unhealthy.Days -0.84 -0.09 -0.04 1.28 0.42
Mentally.Unhealthy.Days -0.71 -0.06 0.03 1.04 0.27
Perc.Low.Birth.Weight -0.74 -0.24 0.09 0.43 1.80
Perc.Obese -0.73 0.13 -0.33 0.59 1.26
Perc.Physically.Inactive -0.84 0.16 -0.43 1.03 0.94
Chlamydia.Rate -0.47 -0.41 0.48 -0.24 1.91
Teen.Birth.Rate -1.02 -0.22 0.34 0.90 1.21
Perc.Uninsured -0.88 -0.12 0.53 0.63 0.63
Dentist.Ratio -0.47 0.26 -0.47 0.63 0.41
Prev.Hosp.Stay.Rate -0.64 -0.01 -0.41 1.18 0.55
Perc.Diabetic.Screening 0.43 0.12 -0.12 -0.35 -0.68
Perc.Mammography 0.67 -0.01 0.07 -0.86 -0.47
Perc.High.School.Grad 0.56 0.37 -0.63 -0.11 -1.16
Perc.Some.College 1.04 -0.17 0.14 -1.02 -0.88
Perc.Unemployed -0.74 -0.08 0.32 0.35 1.04
Perc.Children.in.Poverty -1.03 -0.21 0.29 0.81 1.48
Perc.Single.Parent.HH -0.78 -0.31 0.47 0.13 1.86
Violent.Crime.Rate -0.48 -0.38 0.69 -0.19 1.34
Avg.Daily.Particulates -0.51 0.22 -0.22 0.47 0.39
Perc.pop.in.viol -0.09 -0.04 -0.08 0.32 -0.01
Rec.Facility.Rate 0.63 -0.15 0.11 -0.59 -0.50
Perc.Limited.Access -0.21 -0.25 0.48 -0.07 0.47
Perc.Fast.Foods -0.39 -0.18 0.13 0.39 0.65
PC1 3.44 0.44 -0.42 -3.02 -4.65
PC2 -0.14 0.85 -1.51 1.59 -1.79
PC3 0.02 0.14 -0.37 -0.11 0.44
Let’s visualize:
names(dat)
[1] "YPLL.Rate" "Physically.Unhealthy.Days"
[3] "Mentally.Unhealthy.Days" "Perc.Low.Birth.Weight"
[5] "Perc.Obese" "Perc.Physically.Inactive"
[7] "Chlamydia.Rate" "Teen.Birth.Rate"
[9] "Perc.Uninsured" "Dentist.Ratio"
[11] "Prev.Hosp.Stay.Rate" "Perc.Diabetic.Screening"
[13] "Perc.Mammography" "Perc.High.School.Grad"
[15] "Perc.Some.College" "Perc.Unemployed"
[17] "Perc.Children.in.Poverty" "Perc.Single.Parent.HH"
[19] "Violent.Crime.Rate" "Avg.Daily.Particulates"
[21] "Perc.pop.in.viol" "Rec.Facility.Rate"
[23] "Perc.Limited.Access" "Perc.Fast.Foods"
show <- 1:8
show <- 9:16
show <- 17:22
#show <- 26:30
par(mfrow=c(2,4))
for(i in show){
boxplot( scored[,i] ~ clust, col=col, varwidth=TRUE)
abline(h=0, col="navy")
title(names(scored)[i])
}
So there is quite some variation. To make sense out of all this, let’s visualize the clusters on a map, where we align data with map definitions by matching FIPS codes (using the built-in dataset used by the maps() function):
library(maps)
library(mapproj)
par(mfrow=c(1,1))
data(county.fips) #
head(county.fips)
fips polyname
1 1001 alabama,autauga
2 1003 alabama,baldwin
3 1005 alabama,barbour
4 1007 alabama,bibb
5 1009 alabama,blount
6 1011 alabama,bullock
matches <- clust[match(county.fips$fips, ID$FIPS)]
map("county", col = col[matches], fill = TRUE, resolution = 0,
lty = 0, projection = "polyconic")
map("state", col = "white", fill = FALSE, add = TRUE, lty = 1, lwd = 0.2,
projection="polyconic")
title("County Cluster Assignments Based on Population Health Data", col.main="navy")
txt <- as.character(1:length(unique(clust)))
legend("bottomleft", txt, horiz=TRUE, fill=col)